Fixes #25666: Add retry-with-backoff to getServiceStatus for transient failure recovery by RajdeepKushwaha5 · Pull Request #27112 · open-metadata/OpenMetadata

RajdeepKushwaha5 · 2026-04-07T02:51:11Z

Describe your changes:

Problem: PipelineServiceClient.getServiceStatus() delegates to getServiceStatusInternal() with zero retry logic. A single transient REST failure (e.g., selector manager closed during an Airflow restart or network blip) immediately returns an unhealthy response (code 500), causing OpenMetadata to mark the Airflow agent as UNAVAILABLE. The agent never auto-recovers — a manual UI refresh or service restart is required.

Root Cause: While a getServiceStatusBackoff() method with Resilience4j retry already existed in PipelineServiceClient, it was never called by any production code. Both callers (IngestionPipelineResource.getRESTStatus() and SystemRepository.getPipelineServiceClientValidation()) call getServiceStatus() directly, which had no retry.

Fix: Move the Resilience4j retry-with-backoff logic into getServiceStatus() itself, so every health check benefits from transient failure tolerance:

getServiceStatus() now retries up to 3 attempts with 5-second backoff intervals when the response code is not 200
getServiceStatusBackoff() is simplified to delegate to getServiceStatus() (avoids nested/double retry)
Removed the unused Supplier import

Test: Added getServiceStatusRecoversFromTransientFailure test that enqueues a transient 500 followed by a healthy 200 and verifies the final status is healthy with exactly 3 requests (detection + failed attempt + successful retry).

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.

Summary by Gitar

Refactored configuration handling:
- Introduced getStringParam and getIntParam helpers in AirflowRESTClient to safely parse configuration parameters.
- Added explicit validation for username and password in the constructor, throwing PipelineServiceClientException if missing.
Improved constructor robustness:
- Added support for various timeout types and handled null configurations or missing additionalProperties maps gracefully.
Expanded test coverage:
- Added comprehensive tests for parameter handling, including non-numeric timeouts and missing credentials scenarios.

_{This will update automatically on new commits.}

github-actions · 2026-04-07T02:51:45Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot

Pull request overview

This PR improves ingestion agent health-check resilience by adding Resilience4j retry-with-backoff to PipelineServiceClient.getServiceStatus(), so transient pipeline-service failures (e.g., during Airflow restarts) don’t immediately mark the agent as unavailable.

Changes:

Moved retry-with-backoff logic into PipelineServiceClient.getServiceStatus() and simplified getServiceStatusBackoff() to delegate.
Updated retry to operate on PipelineServiceClientResponse (retrying when HTTP code is non-200).
Added an Airflow REST client test ensuring a transient 500 followed by 200 recovers successfully and issues the expected number of requests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
openmetadata-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java	Centralizes retry-with-backoff in `getServiceStatus()` so all callers benefit from transient failure tolerance.
openmetadata-service/src/test/java/org/openmetadata/service/clients/pipeline/airflow/AirflowRESTClientTest.java	Adds regression test verifying recovery from a transient health-check failure.

github-actions · 2026-04-07T03:31:07Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-04-07T04:54:50Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

github-actions · 2026-04-07T05:38:34Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-04-07T05:50:20Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

github-actions · 2026-04-07T06:00:30Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-04-07T06:12:38Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

…errors

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

…errors

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Copilot · 2026-04-19T18:31:20Z

  public PipelineServiceClientResponse getServiceStatus() {
-    if (pipelineServiceClientEnabled) {
-      return getServiceStatusInternal();
+    if (!pipelineServiceClientEnabled) {
+      return buildHealthyStatus(DISABLED_STATUS).withPlatform(DISABLED_STATUS);
+    }
+    PipelineServiceClientResponse response =
+        retryForServiceStatus().executeSupplier(this::getServiceStatusInternal);


PR description says getServiceStatus() now retries when the response code is “not 200”, but the implementation only retries on null or code >= 500. Please align the PR description with the actual behavior, or adjust the retry predicate if the intended behavior is to retry all non-200 responses.

…errors

gitar-bot · 2026-04-20T08:43:11Z

Code Review ✅ Approved 10 resolved / 10 findings

Implements robust retry-with-backoff logic for service status checks, resolving multiple NPEs, configuration flaws, and incorrect exception handling. All identified issues were addressed, ensuring stable and reliable recovery from transient failures.

✅ 10 resolved

✅ Bug: GetEntityVersionsTool calls non-existent CommonUtils.parseIntParam()

📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/GetEntityVersionsTool.java:51
At line 51, CommonUtils.parseIntParam(params.get("limit"), DEFAULT_LIMIT) is called, but the CommonUtils class in the MCP tools package does not define a parseIntParam() method. This will cause a compilation error.

✅ Bug: CompareEntityVersionsTool NPE when pojoToJson returns null

📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java:115-117
JsonUtils.pojoToJson(null) returns null. At line 117, fromJson.equals(toJson) will throw a NullPointerException if either fromVal or toVal is null (e.g., a field not present in one version). This is a reachable condition since version comparisons commonly involve fields that were added or removed.

✅ Bug: CompareEntityVersionsTool hardcodes table-specific fields

📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java:97-100
The computeDifferences method compares a hardcoded list of fields (columns, tableConstraints, tableType, etc.) that are specific to Table entities. When comparing other entity types (Pipeline, Dashboard, Topic, etc.), this tool will miss all entity-specific fields and only compare the few common ones (description, owners, tags, displayName). The tool's name and parameters suggest it is generic.

✅ Edge Case: MCP tools NPE when exception message is null

📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java:90 📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/GetEntityVersionsTool.java:72
Both CompareEntityVersionsTool (line 90) and GetEntityVersionsTool (line 72) catch all exceptions and call Map.of("error", e.getMessage()). Map.of() does not permit null values, so if e.getMessage() is null (common for NPE, ClassCastException, etc.), this will throw a NullPointerException, masking the original error.

✅ Quality: PR includes unrelated files (notes, scripts, issue templates)

📄 TOP_10_CONTRIBUTIONS.txt 📄 TOP_10_CONTRIBUTIONS_V2.txt 📄 issue-assignment-requests.md 📄 ISSUE_DESCRIPTION.md 📄 ISSUE_DESCRIPTION_RESOURCE_LEAKS.md 📄 ISSUE_SUBJECT_CONTEXT_SILENT_CATCHES.md 📄 PR_DESCRIPTION.md 📄 PR_DESCRIPTION_AIRFLOW_RETRY.md 📄 PR_DESCRIPTION_RESOURCE_LEAKS.md 📄 scripts/fix_basic.py 📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java 📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/GetEntityVersionsTool.java
The PR includes 12 files that are unrelated to the stated fix (retry-with-backoff for getServiceStatus): TOP_10_CONTRIBUTIONS.txt, TOP_10_CONTRIBUTIONS_V2.txt, issue-assignment-requests.md, ISSUE_DESCRIPTION.md, ISSUE_DESCRIPTION_RESOURCE_LEAKS.md, ISSUE_SUBJECT_CONTEXT_SILENT_CATCHES.md, PR_DESCRIPTION*.md, scripts/fix_basic.py, and the two MCP tool files. These appear to be personal working notes and unrelated feature additions that should be in separate PRs. The fix_basic.py script contains a hardcoded Windows path (D:\OpenMetadata\...).

...and 5 more resolved from earlier reviews

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-04-20T09:42:59Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot AI review requested due to automatic review settings April 7, 2026 02:51

Copilot started reviewing on behalf of RajdeepKushwaha5 April 7, 2026 02:51 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from 68791c8 to 9336512 Compare April 7, 2026 03:30

gitar-bot Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/GetEntityVersionsTool.java Outdated

gitar-bot Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java Outdated

gitar-bot Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java Outdated

gitar-bot Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java Outdated

Copilot AI review requested due to automatic review settings April 7, 2026 04:54

RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from 9336512 to f32ab33 Compare April 7, 2026 04:54

Copilot started reviewing on behalf of RajdeepKushwaha5 April 7, 2026 04:55 View session

gitar-bot Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread ...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java Outdated

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread ...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java Outdated

Comment thread ...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java Outdated

RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from f32ab33 to 0d023dc Compare April 7, 2026 05:38

gitar-bot Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread ...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java Outdated

Copilot AI review requested due to automatic review settings April 7, 2026 05:49

RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from 0d023dc to 7a38087 Compare April 7, 2026 05:49

Copilot started reviewing on behalf of RajdeepKushwaha5 April 7, 2026 05:50 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread ...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java Outdated

RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from 7a38087 to a6412ae Compare April 7, 2026 06:00

gitar-bot Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread ...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java Outdated

Copilot AI review requested due to automatic review settings April 7, 2026 06:12

RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from a6412ae to 2676e07 Compare April 7, 2026 06:12

RajdeepKushwaha5 temporarily deployed to test April 8, 2026 00:50 — with GitHub Actions Inactive

Merge branch 'main' into fix/25666-airflow-status-retry-on-transient-…

0b675f3

…errors

RajdeepKushwaha5 requested a review from Copilot April 12, 2026 04:39

Copilot started reviewing on behalf of RajdeepKushwaha5 April 12, 2026 04:39 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

RajdeepKushwaha5 temporarily deployed to test April 12, 2026 04:47 — with GitHub Actions Inactive

RajdeepKushwaha5 and others added 2 commits April 19, 2026 23:55

Merge upstream/main and resolve conflicts in AirflowRESTClientTest

0686407

Merge branch 'main' into fix/25666-airflow-status-retry-on-transient-…

11381b3

…errors

Copilot AI review requested due to automatic review settings April 19, 2026 18:26

Copilot started reviewing on behalf of RajdeepKushwaha5 April 19, 2026 18:27 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

RajdeepKushwaha5 temporarily deployed to test April 19, 2026 18:37 — with GitHub Actions Inactive

RajdeepKushwaha5 had a problem deploying to test April 19, 2026 18:37 — with GitHub Actions Failure

RajdeepKushwaha5 temporarily deployed to test April 19, 2026 18:37 — with GitHub Actions Inactive

Merge branch 'main' into fix/25666-airflow-status-retry-on-transient-…

9a5042c

…errors

RajdeepKushwaha5 temporarily deployed to test April 20, 2026 08:52 — with GitHub Actions Inactive

Conversation

RajdeepKushwaha5 commented Apr 7, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud Bot commented Apr 20, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RajdeepKushwaha5 commented Apr 7, 2026 •

edited by gitar-bot Bot

Loading

gitar-bot Bot commented Apr 20, 2026 •

edited

Loading